Paper: Computational Resource Optimisation in Feature Selection under Class Imbalance Conditions #947

AmadiGabriel · 2024-06-08T05:02:42Z

If you are creating this PR in order to submit a draft of your paper, please name your PR with Paper: <title>. An editor will then add a paper label and GitHub Actions will be run to check and build your paper.

See the project readme for more information.

Editor: Chris Calloway @cbcunc

Reviewers:

Jane Adams @janeadams
Andrei Paleyes @apaleyes

…ings into 2024

github-actions · 2024-06-09T16:33:41Z

Curvenote Preview

Directory	Preview	Checks	Updated (UTC)
papers/amadi_udu	🔍 Inspect	✅ 80 checks passed (4 optional)	Jul 10, 2024, 3:11 PM

AmadiGabriel · 2024-07-08T16:08:28Z

Very nicely written paper, the final render (which I was able to fish out of the build log) is beautiful and easy to read.

Many minor comments are below. I also have a general question for the authors. I struggle a bit to understand the main contribution or take away from this paper. Let's consider some ideas:

Is it the method itself? The method appears to be a combination of several well-established techniques, without too much customisation.

Is it the Python implementation of the method? If so, the emphasis should probably be more on the implementation detail, software design and reuse.

Is it the analysis of feature selection effects on datasets with class imbalance? If so, the paper needs more unified conclusions, general observations that span beyond single dataset. It probably also needs to analyse larger datasets.

Is it the analysis of the three selected models? If so, the selection of the models either needs to be expanded, or needs a solid justification (e.g. we used these models because they are most popular in a field X).

I think the paper can be improved a lot if it had this single (or multiple) contribution(s) strongly emphasised, motivated, and supported. Very happy to discuss this question more here, in this PR, as a part of the review process.

Thank you @apaleyes for your insightful comments that has enhance the content and quality of the paper.
Our ideas are relate closely to the latter two questions asked.

We have provided a more unified conclusion that clarifies that the paper is a preliminary study that considers five datasets with substantial sample size, characterised by class imbalance. A justification for the selection of the models has been included in the text. This informed the choice of PFI for the feature selection process owing to its advantage of being model-agnostic. Expansion to include other models and much larger datasets has been included in the conclusion for further study.

Very nicely written paper, the final render (which I was able to fish out of the build log) is beautiful and easy to read.

Many minor comments are below. I also have a general question for the authors. I struggle a bit to understand the main contribution or take away from this paper. Let's consider some ideas:

Is it the method itself? The method appears to be a combination of several well-established techniques, without too much customisation.

Is it the Python implementation of the method? If so, the emphasis should probably be more on the implementation detail, software design and reuse.

Is it the analysis of feature selection effects on datasets with class imbalance? If so, the paper needs more unified conclusions, general observations that span beyond single dataset. It probably also needs to analyse larger datasets.

Is it the analysis of the three selected models? If so, the selection of the models either needs to be expanded, or needs a solid justification (e.g. we used these models because they are most popular in a field X).

I think the paper can be improved a lot if it had this single (or multiple) contribution(s) strongly emphasised, motivated, and supported. Very happy to discuss this question more here, in this PR, as a part of the review process.

Thank you @apaleyes for the insightful comments provided to this paper, which has enhanced the quality and richness of the paper.
The main contribution of the work border on the latter two questions you raised. Accordingly, we have provided a more unified conclusion and justified the selection of the models. As this is a preliminary investigation, we have also included as part of future work an expansion to introduce some quantitative measure of the variability of models and feature selection methods. Other comments have also been addressed.

AmadiGabriel · 2024-07-08T16:17:09Z

A succinct and interesting read on evaluating permutation feature importance (PFI) impacts on three different classification models (Random Forest, LightGBM, and SVM) with varying proportions of subsampled data featuring unbalanced classes. I have minor comments but overall I think this a great contribution.

The dual axes in the processing time figure were odd to me at first; it might be valuable to explain that SVM's poor performance relative to the other two methods is likely due to its poor parallelizability (if that's a word)

The "decrease in AUC" figures are confusing in that negative x-axis values must therefore indicate increased in AUC? (Correct me if I am misunderstanding). This forces the reader to think about a "double negative makes a positive" which adds possibly unnecessary complexity to interpretation. I would recommend either 1) changing the axis / measure to just be "change in AUC" and/or 2) adding annotations directly onto the white space with an arrow indicating "poorer performance this direction" or similar.

I particularly appreciated the pre-filtering step of using hierarchical clustering of features to account for potential collinearities. I also appreciated that the authors used multiple data sets and evaluated at a range of sample proportions. This is a nice example of how a lot of scientific computing python libraries can come together into a single interesting experiment.

Thank you @janeadams for the observations and review of the paper. This has provided clarity to aspects of the data visualisation and improved the deductions on the model performance.

An explanation for SVM’s poor performance included in the text.
Axis changed to “change in AUC”. More explanation has been included to clarify positive and negative PFI performance results.

apaleyes · 2024-07-09T17:16:16Z

Lovely, thanks for all the work on updating the paper @AmadiGabriel ! I'll have another look shortly

apaleyes · 2024-08-07T00:06:52Z

"shortly" ha-ha (one month later)

Anyhow, @cbcunc @ameyxd I am happy with the changes made to the paper and how the comments were addressed. If I am reading it right some additional experiments were run, quite impressive!

To the authors, I still think the number of features in the datasets reviewed isn't big enough to justify feature selection. It absolutely works as the first step, but would be nice to see the follow up work on larger datasets, as the future work paragraph promises. Same conference next year?

AmadiGabriel added 14 commits June 8, 2024 05:40

all changes

18b5847

Update main.md

3a7676c

Update main.md

a2bf6a3

Update myst.yml

74a8c06

Create images

7767e98

Update main.md

a4000cc

remove images

69efd33

one image

e0989c4

Update main.md

5fe3e62

resultspics

076b9a7

aMerge branch '2024' of https://github.com/AmadiGabriel/scipy_proceed…

6b8397d

…ings into 2024

Update main.md

578d3e8

Update myst.yml

6cbc868

Update main.md

45d3c1c

ameyxd self-assigned this Jun 8, 2024

ameyxd added the paper This indicates that the PR in question is a paper label Jun 8, 2024

mepa changed the title ~~paper: Computational Resource Optimisation in Feature Selection under Class Imbalance Conditions~~ Paper: Computational Resource Optimisation in Feature Selection under Class Imbalance Conditions Jun 9, 2024

Credit Notes myst.yml

05f3cd5

AmadiGabriel added 11 commits June 9, 2024 18:00

Update main.md

62d8fa1

Update main.md

0ad5c85

Update myst.yml

fd29f61

removed googlesholarID

de0fa7f

Update myst.yml

c6a637b

Update main.md

301afc6

Updated subfigures

02b0181

Update main.md

eef3f3b

Update main.md

fa0fb5e

updated all subfigures

91505e2

Update main.md

5d6b503

AmadiGabriel added 23 commits July 3, 2024 23:08

Delete papers/amadi_udu/images/gsvs_cmap.png

edbadc5

Delete papers/amadi_udu/images/gsvs_hierclus.png

cc3291e

Delete papers/amadi_udu/images/gsvs_hierclus_cmap.png

3db8513

merged fig 2

b1ba7d2

Update main.md

9afc966

Update main.md

dadada1

Delete papers/amadi_udu/images/rf_statlog_shuttle_feature_0.png

290ce18

Delete papers/amadi_udu/images/lgbm_statlog_shuttle_feature_0.png

0834a43

Delete papers/amadi_udu/images/svm_statlog_shuttle_feature_0.png

97e98c9

updated fig5

1094652

Delete papers/amadi_udu/images/rf_census_income_feature_2.png

cc7cf3b

Delete papers/amadi_udu/images/lgbm_census_income_feature_2.png

518003d

Delete papers/amadi_udu/images/svm_census_income_feature_2.png

32bc957

census_income_fig

83eb8b1

Update main.md

faa1aa7

Delete papers/amadi_udu/images/lgbm_bank_marketing_feature_5.png

7ae07d4

Delete papers/amadi_udu/images/rf_bank_marketing_feature_5.png

4a338e2

Delete papers/amadi_udu/images/svm_bank_marketing_feature_5.png

fc20a80

fig 5 updated

1a31a8d

Update main.md

c3ef82e

Update main.md

acb9438

Update main.md

8e913d1

Update main.md

c146256

Update main.md

fed9158

AmadiGabriel added 2 commits July 8, 2024 17:21

Update main.md

ac265b5

Update main.md

dbd1e7b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paper: Computational Resource Optimisation in Feature Selection under Class Imbalance Conditions #947

Paper: Computational Resource Optimisation in Feature Selection under Class Imbalance Conditions #947

AmadiGabriel commented Jun 8, 2024 •

edited by cbcunc

Loading

github-actions bot commented Jun 9, 2024 •

edited

Loading

AmadiGabriel commented Jul 8, 2024

AmadiGabriel commented Jul 8, 2024 •

edited

Loading

apaleyes commented Jul 9, 2024

apaleyes commented Aug 7, 2024

Paper: Computational Resource Optimisation in Feature Selection under Class Imbalance Conditions #947

Are you sure you want to change the base?

Paper: Computational Resource Optimisation in Feature Selection under Class Imbalance Conditions #947

Conversation

AmadiGabriel commented Jun 8, 2024 • edited by cbcunc Loading

github-actions bot commented Jun 9, 2024 • edited Loading

AmadiGabriel commented Jul 8, 2024

AmadiGabriel commented Jul 8, 2024 • edited Loading

apaleyes commented Jul 9, 2024

apaleyes commented Aug 7, 2024

AmadiGabriel commented Jun 8, 2024 •

edited by cbcunc

Loading

github-actions bot commented Jun 9, 2024 •

edited

Loading

AmadiGabriel commented Jul 8, 2024 •

edited

Loading